4 - Estimation

In this lecture, we discuss parametric and non-parametric estimation. We also build on what we learned in past lectures to address challenging estimation tasks like estimating parameters for a mixture of Gaussians.

4.1 Parametric Estimation

We first discuss estimation of parametric models (or distributions) of the data, which amounts to estimating the unknown parameter(s). A distribution is said to be parametric, if we can fully describe it by a finite set of parameters. Many of the distributions we have seen so far satisfy this property: they come from a family of distributions (e.g., a Bernoulli) that is fully specified (that is, we know how to compute probabilities) once we specify the value of its parameter(s) ($p$ in the case of Bernoulli).

Example 1 Some examples of parametric distributions include: * $\mathrm{Bernoulli}(p)$, which is a discrete distribution on the set $\{0, 1\}$, fully specified by its parameter $p:$ $\mathbb{P}[X = 1] = 1 - \mathbb{P}[X = 0] = p.$ * $\mathrm{Binomial}(n, p),$ which is a discrete distribution on the set $\{0, 1, \dots, n\}$, fully specified by parameters $n$ and $p:$ $\mathbb{P}[X = k] = {n \choose k} p^k (1-p)^{n-k}$. * $\mathrm{Poisson}(\lambda),$ which is a discrete distribution on the set of non-negative integers $0, 1, 2, \dots,$ fully specified by its parameter $\lambda$: $\mathbb{P}[X = k] = \frac{\lambda^k e^{-\lambda}}{k!}.$ * Exponential distribution $\mathrm{Exp}(\lambda),$ which is a continuous distribution on the set of non-negative reals $[0, + \infty),$ fully specified by its parameter $\lambda,$ as its probability density function (pdf) is given by $f_{\lambda}(x) = \lambda e^{-\lambda x}$. * Gaussian distribution $\mathcal{N}(\mu, \sigma^2),$ which is a continuous distribution on the set of real numbers $(-\infty, + \infty),$ fully specified by its parameters $\mu$ and $\sigma^2,$ as its pdf is given by $\phi(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}.$

4.1.1 Maximum Likelihood Estimation (MLE)

A maximum likelihood estimator (MLE) is used to estimate the value of the unknown parameter(s) of a probability distribution from a known family of probability distributions. MLE is widely used in data science as it can usually be proved to have very good statistical properties. We will see some examples where MLE can be computed in closed form and it aligns with intuitive estimators, like the empirical estimator. However, as MLE corresponds to an optimization problem in general, it is not always efficiently computable, and that's its main limitation.

The goal of a maximum likelihood estimator is to determine for which value of the parameter does the observed data have the largest probability. We will make this more formal in an instant, but to do so it helps to look at concrete examples.

MLE for Discrete Distributions

MLE computes the value of the parameter for which "the data is most likely" by maximizing a likelihood function. For discrete distributions, the likelihood function is the conditional probability

$$ f(\text{parameter}) := \mathbb{P}[\text{data}|\text{parameter}]. $$

You should think about this conditional probability as: the probability of the data given the unknown parameter that fully specifies the distribution from the given family of distributions. Let's look at a specific example.

Example 2 Consider tossing a coin a hundred times and seeing 67 heads. You know that the number of heads you would see in 100 coin tosses is distributed according to a binomial distribution with parameters $n = 100$ and $p$, which is the unknown parameter that determines the probability of the coin landing heads up. In this case, you have that $$ \mathbb{P}[\text{data}|\text{parameter}] = \mathbb{P}[67 \text{ heads}|p] = {100 \choose 67} p^{67}(1-p)^{33}. $$ MLE tries to maximize this probability. Recalling what you learned in calculus classes, you know that if you want to maximize a function $f(p),$ you can compute its first derivative and set it equal to zero. Then if the second derivative at the same point is negative, the point you found is a local maximum. Let us try to do that for $f(p) = {100 \choose 67} p^{67}(1-p)^{33}.$ $$ f'(p) = {100 \choose 67} \Big(67 p^{66}(1-p)^{33} - 33 p^{67}(1-p)^{32}\Big); $$ $$ \quad \quad f''(p) = {100 \choose 67} \Big(67\cdot 66 p^{65}(1-p)^{33} - 67\cdot 33 p^{66}(1-p)^{32} - 33 \cdot 67 p^{66}(1-p)^{32} + 33 \cdot 32 p^{67}(1-p)^{31}\Big). $$ We can verify that $f'(p) \neq 0$ for $p = 0$ and $p = 1.$ Setting $f'(p) = 0,$ we get that \begin{align*} &\; 67 p^{66}(1-p)^{33} = 33 p^{67}(1-p)^{32}\\ \Leftrightarrow &\; 67(1-p) = 33p\\ \Leftrightarrow &\; 100p = 67 \\ \Leftrightarrow &\; p = 0.67. \end{align*} Plugging $p = 0.67$ into the expression for $f''(p)$, we can verify that $f''(p) < 0.$ Thus, $f(p)$ is maximized for $ p = 0.67,$ which gives us the maximum likelihood estimate. Observe that this estimate aligns well with the intuition that $p$ equals the fraction of the observed heads.

Instead of working with the likelihood function, it is often more convenient for the calculations to work with the log-likelihood. This works because the logarithm is an increasing function and we are interested in maximizing the likelihood, so maximizing the log-likelihood is the same as maximizing the likelihood. In the coin tossing example above, log-likelihood $g(p)$ equals $\log(f(p)),$ leading to

$$ g(p) = \log\Big({100 \choose 67}\Big) + 67 \log(p) + 33 \log(1-p). $$

Computing the first and the second derivative is slightly easier for $g(p)$, as we get rid off some of the large constants. In particular,

\begin{align*} g'(p) &= \frac{67}{p} - \frac{33}{1-p},\\ g''(p) &= - \frac{67}{p^2} - \frac{33}{(1-p)^2}. \end{align*}

Solving $g'(p) = 0$ for $p$, we get the same result, $p = 0.67.$ On the other hand, $g''(p) < 0$ for any $p \in (0, 1),$ and we can immediately conclude that $p = 0.67$ maximizes the log-likelihood (and, thus, maximizes the likelihood as well).

MLE for Continuous Distributions

For continuous distributions, using conditional probabilities as in the discrete case becomes problematic, as the probability of a continuous random variable taking any specific, fixed value is zero. Instead, we use the probability density function (pdf), conditioned on the parameter we want to estimate. The intuition is that we should view the specific data points that we collect as representing small intervals around the observed values. For instance, if we see a data point $X_i = 5,$ then the "probability" of seeing that point should be seen as the probability of getting a point from an interval $[X_i - \frac{\delta}{2}, X_i + \frac{\delta}{2}]$ for some small $\delta > 0.$ This probability for small $\delta > 0$ essentially becomes $\approx f(x = X_i|p)\delta,$ where $f$ is the probability density function from the specified family determined by the unknown parameter $p$.

Similar to the case of discrete distributions, it is often more convenient for calculations to work with the log-likelihood $g(p) = \log(f(x|p)).$

Example 3-Fitting a Spherical Gaussian to Data Suppose you have access to $n$ independent data vectors of dimension $d$ that you know came from a Gaussian distribution $\mathcal{N}(\boldsymbol{\mu}, \sigma^2 \mathbf{I}),$ but you don't know the distribution parameters $\boldsymbol{\mu}$ and $\sigma^2$. The pdf of the Gaussian for a data vector $\mathbf{x}_i,$ $i \in \{1, \dots, n\},$ provided the unknown parameters are $\boldsymbol{\mu}$ and $\sigma^2$, is $$ f(\mathbf{x} = \mathbf{x}_i |\text{parameters} = \{\boldsymbol{\mu},\, \sigma^2\}) = \frac{1}{(\sqrt{2\pi}\sigma)^d}e^{-\frac{\|\mathbf{x}_i - \boldsymbol{\mu}\|_2^2 }{2\sigma^2}}. $$ Since all of the $n$ data vectors are independent, the pdf for all of them given the unknown parameters $\boldsymbol{\mu}$ and $\sigma^2$ (the likelihood function) is $$ \frac{1}{(\sqrt{2\pi}\sigma)^{dn}}e^{-\frac{\|\mathbf{x}_1 - \boldsymbol{\mu}\|_2^2 + \|\mathbf{x}_2 - \boldsymbol{\mu}\|_2^2 + \dots + \|\mathbf{x}_n - \boldsymbol{\mu}\|_2^2 }{2\sigma^2}}. $$ Taking the logarithm on both sides, we get that the log-likelihood function is $$ g(\boldsymbol{\mu}, \sigma^2) = - \frac{dn}{2}\log(2\pi) - \frac{dn}{2} \log(\sigma^2) - \frac{\|\mathbf{x}_1 - \boldsymbol{\mu}\|_2^2 + \|\mathbf{x}_2 - \boldsymbol{\mu}\|_2^2 + \dots + \|\mathbf{x}_n - \boldsymbol{\mu}\|_2^2 }{2\sigma^2}. $$ As discussed above, the MLE estimate for the unknown parameteres $\boldsymbol{\mu}$ and $\sigma^2$ maximizes $g(\boldsymbol{\mu}, \sigma^2)$. Now $g$ is a multivariate function (of $d+1$ variables $\mu_1, \dots, \mu_d$ and $\sigma^2)$ and we have not yet talked about how to find extrema of multivariate functions $-$ we will do that later in the semester. For this particular function, it suffices to compute all partial derivatives and set them to zero. Recalling that $\|\mathbf{x}_i - \boldsymbol{\mu}\|_2^2 = \sum_{j=1}^d ((\mathbf{x}_i)_j - \mu_j)^2$ and taking all partial derivatives, we get \begin{align} \frac{\partial g}{\partial \mu_j}(\mathbf{\mu, \sigma^2}) &= \sum_{i=1}^n \frac{(\mathbf{x}_i)_j - \mu_j}{\sigma^2}, \quad j = 1, 2, \dots, d\\ \frac{\partial g}{\partial \sigma^2}(\mathbf{\mu, \sigma^2}) &= - \frac{dn}{2\sigma^2} + \frac{\|\mathbf{x}_1 - \boldsymbol{\mu}\|_2^2 + \|\mathbf{x}_2 - \boldsymbol{\mu}\|_2^2 + \dots + \|\mathbf{x}_n - \boldsymbol{\mu}\|_2^2}{2 (\sigma^2)^2}. \end{align} The first set of equations when set to zero can be made independent of $\sigma^2$ (by multiplying both sides by $\sigma^2$) and leads to $$ \mathbf{\hat{\mu}}_{\text{MLE}} = \frac{\mathbf{x}_1 + \mathbf{x}_2 + \dots + \mathbf{x}_n}{n}, $$ which is the empirical estimate of the mean. With this solution, we can then solve the last equation from the above system to get $$ \hat{\sigma}^2_{\rm MLE} = \frac{\|\mathbf{x}_1 - \mathbf{\hat{\mu}}_{\rm MLE}\|_2^2 + \|\mathbf{x}_2 - \mathbf{\hat{\mu}}_{\rm MLE}\|_2^2 + \dots + \|\mathbf{x}_n - \mathbf{\hat{\mu}}_{\rm MLE}\|_2^2}{dn}. $$

Observe that once again MLE aligns with an intuitive $-$ empirical $-$ estimator for this problem, which is the reason why MLE is commonly used in parametric estimation. Note also that the above estimate of the variance (as you also saw in HW 2) is not unbiased. To get an unbiased estimator, we would need to divide by $n-1$ in place of $n$. However, when $n$ is large, there is not much difference between dividing by $n$ and $n-1$ and the two estimators give similar results.

Example 4-Separating a Mixture of Gaussians Consider the case where the data comes from a mixture of Gaussians, meaning that some fraction of the data comes from an unknown Gaussian distribution 1 and the rest of the data comes from another Gaussian distribution 2. In this case, the pdf of the unknown mixture is $w_1 f_1(\mathbf{x}) + w_2 f_2(\mathbf{x}_2),$ where $w_1, w_2$ are the unknown fractions (and so $w_1 \geq 0,$ $w_2 \geq 0,$ and $w_1 + w_2 = 1$) and $f_1$ and $f_2$ are the pdfs of the two unknown Gaussians. This is the simplest case of the heterogeneous data. For example, consider the data representing financial records of people from two countries with different standards/income levels, and assume that for each of the countries those data follow a Gaussian distribution. While some financial records (e.g., lower income ones) in the high-income country may be similar to some financial records (e.g., higher income ones) in the low-income country, in general, "typical" financial records would be most similar to other financial records from the same country. Now, if we were given a set of financial records $\mathbf{x}_1, \mathbf{x_2}, \dots, \mathbf{x}_n$ and labels for each of these vectors telling us which country they belonged to, the estimation of the unknown parameters would boil down to empirical estimates of $w_1, w_2$ and empirical estimates for the parameters of the two Gaussian distributions carried out using the previous example. However, if we do not have such labels, the main difficulty becomes assigning each of the vectors to one of the two populations. If we tried writing the log-likelihood and maximizing it directly, this would unfortunately lead to an optimization problem that is unclear how to solve. In full generality, MLE for this problem is known to be computationally intractable (more specifically, it is NP-hard). However, the specific variant we will consider below is tractable when solved approximately, as we will argue. To make things more concrete, suppose that the two unknown distributions are $\mathcal{N}(\boldsymbol{\mu}_1, \mathbf{I})$ and $\mathcal{N}(\boldsymbol{\mu}_2, \mathbf{I})$, where the mean vectors $\boldsymbol{\mu}_1, \boldsymbol{\mu}_2$ are unknown. Denote by $\Delta := \|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\|_2$ the distance between the two means. The main idea is as follows: we can use the Gaussian annulus theorem to argue that when two vectors come from the same Gaussian distribution, they are very likely to be within a $\sqrt{2}(\sqrt{d} + t)$ distance for some small $t$ (we'll see that $t$ of the order $\sqrt{\log(n)}$ is pretty good). When they come from different Gaussians that are separated by $\Delta$ sufficiently large, they are highly unlikely to be within the distance $\sqrt{2}(\sqrt{d} + t)$. Then the separation algorithm is simple: we can pick one of the data vectors (a random or the first one would do), compute its pairwise distances to all other data vectors, assign all those data vectors within the distance $\sqrt{2}(\sqrt{d} + t)$ to the same cluster/distribution, and assign the rest to the second cluster/distribution. It should be intuitively clear that when $\Delta$ is at least order-$\sqrt{d}$, the approach above will successfully cluster almost all of the data vectors (with high probability), as in this case the two Gaussians are highly unlikely to have any overlap (due to the Gaussian Annulus Theorem). However, a more careful analysis that we will carry out below shows that it suffices to choose $\Delta$ that is at least of the order $d^{1/4}\sqrt{\log(n)}$, which is generally much smaller than $\sqrt{d}.$ There is an even better argument that allows for constant separation (where $\Delta$ is allowed to be a constant independent of $d$, but possibly scaling with $\sqrt{\log(n)}$), but it goes beyond the scope of this course. Let us first determine $t > 0$ that ensures all the data vectors from the same Gaussian distribution are within the distance $\sqrt{2}(\sqrt{d} + t)$ of each other. Consider any two data vectors from the dataset that we'll denote as $\mathbf{x}_1$ and $\mathbf{x}_2$. If they are drawn from the same Gaussian distribution (either $\mathcal{N}(\boldsymbol{\mu}_1, \mathbf{I})$ or $\mathcal{N}(\boldsymbol{\mu}_2, \mathbf{I})$), then $\frac{\mathbf{x}_1 - \mathbf{x}_2}{\sqrt{2}} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ (why?). As a consequence, applying the (one-sided version of) Gaussian Annulus Theorem to $\frac{\mathbf{x}_1 - \mathbf{x}_2}{\sqrt{2}},$ we have $$ \mathbb{P}\big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \geq \sqrt{2}(\sqrt{d} + t)\big] \leq e^{-t^2/8}. $$ Using the union bound, the probability that there exists a pair of vectors in the dataset that come from the same Gaussian distribution but are at a distance larger than $\sqrt{2}(\sqrt{d} + t)$ is at most $$ n^2 e^{-t^2/8}, $$ where we used a loose bound $n^2$ on the number of possible pairs of numbers from $\{1, \dots, n\}.$ Setting $n^2 e^{-t^2/8} \leq \delta$ for some small $\delta$ (e.g., $\delta \approx \frac{1}{n}$), we get that for $t = \mathrm{constant}\cdot \sqrt{\log(n/\delta)}$ (which for $\delta \approx \frac{1}{n}$ is $\mathrm{constant}\cdot \sqrt{\log(n)}$) we can ensure that: _With probability at least $1-\delta$, all data vectors generated from the same Gaussian are within distance $\sqrt{2}(\sqrt{d} + t)$ of each other._ What we have proved so far is, of course, not enough, as it can still happen that data vectors from different Gaussians are also close to each other. This is where ensuring the separation $\Delta$ between the two Gaussian means is large enough becomes crucial. In particular, what we need to argue is that if two data vectors are generated from different Gaussians they are highly unlikely to be within distance $\sqrt{2}(\sqrt{d} + t)$ of each other. To do so, we argue that any fixed pair of data vectors $\mathbf{x}_1$ and $\mathbf{x}_2$ drawn from the different two Gaussians satisfies $$ \mathbb{P}\big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\big] \leq \frac{\delta}{n^2}. $$ Then, by a union bound, we can conclude that any two vectors that belong to different Gaussians must have distance larger than $\sqrt{2}(\sqrt{d} + t)$, with probability at least $1 - \delta.$ Then by a union bound again, the algorithm we outlined at the beginning correctly assigns data vectors to the two Gaussians with probability at least $1 - 2\delta.$ Let us now argue that for $\Delta$ sufficiently large (but order-$d^{1/4}\sqrt{\log(n/\delta)}$ works) we have that any fixed pair of data vectors $\mathbf{x}_1$ and $\mathbf{x}_2$ drawn from the different two Gaussians satisfies $$ \mathbb{P}\big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\big] \leq \frac{\delta}{n^2}. $$ Concretely, suppose (and this is without loss of generality) that $\mathbf{x}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \mathbf{I})$ and $\mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_2, \mathbf{I})$. Observe that $\frac{\mathbf{x}_1 - \mathbf{x}_2 - (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)}{\sqrt{2}} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ (why?). When we proved the Gaussian Annulus Theorem, we in fact proved that for $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, we have $$ \mathbb{P}\big[\|\mathbf{z}\|_2^2 \leq d - s\sqrt{d}\big] \leq e^{-s^2/8}. $$ As we have discussed above for $t$, for $s = \mathrm{constant}\cdot \sqrt{\log(n/\delta)}$, the above probability can be made smaller than $\frac{\delta}{2 n^2}.$ We want to apply this inequality to $\mathbf{z} = \frac{\mathbf{x}_1 - \mathbf{x}_2 - (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)}{\sqrt{2}},$ but what we actually want to bound is $\mathbb{P}[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)],$ so we need to relate $\|\mathbf{x}_1 - \mathbf{x}_2\|_2$ and $\|\mathbf{z}\|_2.$ Squaring both sides in $\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)$ and expanding the square in the definition of $\|\mathbf{z}\|_2^2,$ it is not hard to show that $$ \|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\quad\quad \Leftrightarrow \quad\quad\|\mathbf{z}\|_2^2 \leq d + \sqrt{d}t + t^2 + \frac{1}{2}\Delta^2 - (\mathbf{x}_1 - \mathbf{x}_2)^\top (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2), $$ where we have used that by definition $\Delta = \|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\|_2$. Now if we could argue that the right-hand side of the right inequality is at most $d - s\sqrt{d}$ for $s = \mathrm{constant}\cdot \sqrt{\log(n/\delta)}$, we would be done by the variant of Gaussian Annulus Theorem we stated above for $\mathbf{z}.$ Unfortunately, what appears on the right-hand side contains $- (\mathbf{x}_1 - \mathbf{x}_2)^\top (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2),$ which is also a random variable. Here we use the same trick we used before: we condition on an event $\mathcal{E}$ that $- (\mathbf{x}_1 - \mathbf{x}_2)^\top (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)$ is of the order we need to prove the statement, and argue that the probability of the complement of that event is small. Then, similar to what we did in past lectures, \begin{align} \mathbb{P}\big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\big] &\leq \mathbb{P}\Big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\Big|\;\mathcal{E}\Big]\mathbb{P}[\mathcal{E}] + \mathbb{P}\Big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\Big|\;\mathcal{E}^c\Big]\mathbb{P}[\mathcal{E}^c]\\ &\leq \mathbb{P}\Big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\Big|\;\mathcal{E}\Big] + \mathbb{P}[\mathcal{E}^c]. \end{align} (What did we use here?) In particular, if the event we choose is $$\mathcal{E} := \Big\{- (\mathbf{x}_1 - \mathbf{x}_2)^\top (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) \leq - \frac{3}{4} \|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\|_2^2\Big\},$$ then $$ \mathbb{P}\Big[\|\mathbf{x}_1 - \mathbf{x}_2\|_2 \leq \sqrt{2}(\sqrt{d} + t)\Big|\;\mathcal{E}\Big] = \mathbb{P}\Big[\|\mathbf{z}\|_2^2 \leq d + \sqrt{d}t + t^2 - \frac{1}{4}\Delta^2\Big]. $$ As we discussed above, when $\sqrt{d}t + t^2 - \frac{1}{4}\Delta^2 \leq -s = -\mathrm{constant}\cdot \sqrt{\log(n/\delta)}$, which occurs for $\Delta \gtrsim d^{1/4}\sqrt{\log(n/\delta)},$ the above probability is bounded by $\frac{\delta}{2n^2},$ which is what we are shooting for. It remains to bound the probability $\mathbb{P}[\mathcal{E}^c] = \mathbb{P}[- (\mathbf{x}_1 - \mathbf{x}_2)^\top (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2) \leq - \frac{3}{4} \|\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\|_2^2]$ by $\frac{\delta}{2 n^2}$. You can do this as an exercise by arguing that $- (\mathbf{x}_1 - \mathbf{x}_2)^\top (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)$ follows a Gaussian distribution and applying a concentration bound for univariate Gaussians.

4.2 Non-Parametric Estimation

So far we have discussed the estimation of parametric distributions, where estimating a distribution amounts to estimating the unknown parameter(s). However, there are cases where we cannot make an assumption that a distribution in parametric, either because there is no simple parametric form or we simply do not know that there is one. As a concrete example, let us consider estimating a distribution that is supported on $k \geq 1$ elements (recall you saw one such example for the GPA distribution in HW $\#2$). You can also think of this example as estimating a histogram of a one-dimensional distribution, where we have pre-decided on the bins of the histogram. Without loss of generality, we will denote the $k$ elements of this unknown distribution by $1, 2, \dots, k.$

Before delving into more detail, we first need to define what a good estimate is for us. As before, we would like to have a confidence interval-type guarantee with parameters $\epsilon, \delta \in (0, 1)$, where we can claim that the estimated distribution $\mathbf{\hat{p}} = (\hat{p}_1, \hat{p}_2, \dots, \hat{p}_k)$ is "$\epsilon$-close" to the target distribution $\mathbf{p} = (p_1, p_2, \dots, p_k)$ with probability (confidence) $1 - \delta.$ But what does "$\epsilon$-close" even mean? To make the statement precise, we need to define how we measure the distance between two distributions. In HW $\#2$, we were satisfied with ensuring that for each bin $i \in \{1, 2, \dots, k\},$ we had that $|p_i - \hat{p}_i| \leq \epsilon.$ However, when $k$ is large, this may not be a sufficiently good estimate, as the error we may be making on large sets (spanning order-$k$ bins/elements) can be of order $k \epsilon$. There are many other notions of distances between probability distributions, but the one we will use here is perhaps the most commonly used one. It is called the total variation (TV) distance, and is defined as the maximum disagreement between the distributions over all possible subsets of $\{1, 2, \dots, k\}$. In particular,

$$ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) := \max_{S \subset \{1, 2, \dots, k\}}(p(S) - \hat{p}(S)). $$

It is possible to argue that in this case we also have

$$ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) = \frac{1}{2}\sum_{i=1}^k |p_i - \hat{p}_i|. $$

The quantity $\sum_{i=1}^k |p_i - \hat{p}_i|$ is also a norm known as the $\ell_1$ norm (or Manhattan distance), denoted by $\|\mathbf{p} - \mathbf{\hat{p}}\|_1 = \sum_{i=1}^k |p_i - \hat{p}_i|$.

Proving that $ d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) = \frac{1}{2}\sum_{i=1}^k |p_i - \hat{p}_i|$ is not hard, so let us do this quickly. We start with looking at the sum $\sum_{i=1}^k (p_i - \hat{p}_i)$ and breaking it into two sums: where $p_i > \hat{p}_i$ and where $p_i < \hat{p}_i$ (if $p_i = \hat{p}_i$, the summand is zero and we can ignore it). In particular, because $\sum_{i=1}^k p_i = \sum_{i=1}^k \hat{p}_i = 1$ (both are probability distributions), we have

$$ 0 = \sum_{i=1}^k (p_i - \hat{p}_i) = \sum_{i: p_i > \hat{p}_i}(p_i - \hat{p}_i) + \sum_{i': p_{i'} < \hat{p}_{i'}}(p_{i'} -\hat{p}_{i'}). $$

Now, all the summands in $\sum_{i: p_i > \hat{p}_i}(p_i - \hat{p}_i)$ are positive, so we can write $\sum_{i: p_i > \hat{p}_i}(p_i - \hat{p}_i) = \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i|$. On the other hand, all the summands in $\sum_{i': p_{i'} < \hat{p}_{i'}}(p_{i'} -\hat{p}_{i'})$ are negative, and so $\sum_{i': p_{i'} < \hat{p}_{i'}}(p_{i'} -\hat{p}_{i'}) = - \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}|.$ As a result,

$$ \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i| = \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}|. $$

Now the set $S$ that maximizes the disagreement $(p(S) - \hat{p}(S))$ is precisely the maximal set where $p_i > \hat{p}_i$ for all $i \in S$ (otherwise we could exclude some of the elements $i$ from $S$ and increase $p(S) - \hat{p}(S)$, which would be a contradiction). Thus, as we have argued that $ \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i| = \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}| = d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}),$ we finally conclude that

$$ \sum_{i=1}^k |p_i - \hat{p}_i| = \sum_{i: p_i > \hat{p}_i}|p_i - \hat{p}_i| + \sum_{i': p_{i'} < \hat{p}_{i'}}|p_{i'} -\hat{p}_{i'}| = 2 d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}), $$

as claimed.

To recap, given error parameters $\epsilon, \delta \in (0, 1)$ and sample access to an unknown distribution $\mathbf{p}$ supported on $k$ elements, our goal is to find an estimate $\mathbf{\hat{p}}$ (supported on the same $k$ elements) so that with probability (at least) $1 - \delta,$ it holds $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) \leq \epsilon.$

We will use the same (and quite natural) estimator for this problem, which is just the empirical estimator (we have justified it before via weak law of large numbers). In more detail, given samples $X_i,$ $i = 1, 2, \dots, n,$ we let $Y_{ij}$ be the Bernoulli random variable that is equal to one if $X_i = j$ and equal to zero otherwise. Our estimate is then $\hat{p}_i = \frac{\sum_{j=1}^n Y_{ij}}{n}.$

It is possible to bound the sample complexity of this estimator using Chebyshev's inequality and you can do this as an exercise. The resulting number of required samples $n$ we would get this way would be of the order $\frac{k}{\delta \epsilon^2}.$ Alternatively, using results from HW $\#2,$ we could use Hoeffding's bound to ensure $|p_i - \hat{p}_i| \leq 2\epsilon/k,$ which would immediately lead to $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) \leq \epsilon$ using the union bound. The required number of samples we would get this way would be of the order $\frac{k^2 + \log(k/\delta)}{\epsilon^2},$ which gives us a better dependence on $\delta$ than the Chebyshev bound, but a worse dependence on $k$. You may be wondering now if it is possible to get the best of both: linear dependence on $k$ and logarithmic dependence on $1/\delta.$ It turns out that the answer is "yes," but we need to be more careful how we carry out the argument using Hoeffding's bound.

To obtain the tightest bound on sample complexity (which turns out to be the best possible, though we won't prove that), we use the original definition of the TV distance $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) := \max_{S \subset \{1, 2, \dots, k\}}(p(S) - \hat{p}(S)).$ Consider first any subset $S$ of $\{1, 2, \dots, k\}.$ The random variable $\hat{p}(S) = \sum_{i \in S} \hat{p}_i = \sum_{i \in S} \frac{\sum_{j=1}^n Y_{ij}}{n}$ is clearly bounded between 0 and 1 (as is each $Y_{ij}$). Moreover, by linearity of expectation, $\mathbb{E}[\hat{p}(S)] = p(S).$ Applying (two-sided) Hoeffding bound, we thus have

$$ \mathbb{P}[|\hat{p}(S) - p(S)| \geq \epsilon] \leq 2 e^{-2 \epsilon^2 n}. $$

Now, stating that $d_{\rm TV}(\mathbf{p}, \mathbf{\hat{p}}) := \max_{S \subset \{1, 2, \dots, k\}}(p(S) - \hat{p}(S))$ with probability at least $1 - \delta$ is the same as saying that for every possible set $S$, with probability $1 - \delta,$ $|\hat{p}(S) - p(S)| \leq \epsilon.$ Equivalently, what we want to show is that the probability that there exists at least one set $S$ such that $|\hat{p}(S) - p(S)| \geq \epsilon$ is at most $\delta.$ Since there are $2^k$ different subsets of $\{1, 2, \dots, n\},$ applying the union bound we get that

$$ \mathbb{P}[\exists S \subset \{1, 2, \dots, n\}: |\hat{p}(S) - p(S)| \geq \epsilon] \leq 2^k \cdot 2 e^{-2 \epsilon^2 n}. $$

Now, to find the required number of samples, it remains to require $2^k \cdot 2 e^{-2 \epsilon^2 n} \leq \delta$ and solve for $n$. Doing so, we get $n \geq \frac{k \log(2) + \log(2/\delta)}{2 \epsilon^2}$ and thus $n = \left \lceil \frac{k \log(2) + \log(2/\delta)}{2 \epsilon^2} \right\rceil$ samples suffice.

Bibliographic Notes

This lecture was prepared using (i) a lecture note on MLE from MIT 18.05 as taught by Orloff & Bloom in 2005 (for the discrete MLE example), (ii) Chapter 2 of the book by Blum-Hopcroft-Kannan (for continuous MLE examples), and (iii) a note by Clement Cannone (arXiv:2002.11457, for the last section on non-parametric estimation). The specific derivations are my own and adapted to agree with the rest of the course material.